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Abstract 

Motivated by the task of hyperparameter opti¬ 
mization, we introduce the non-stochastic best- 
arm identification problem. Within the multi¬ 
armed bandit literature, the cumulative regret ob¬ 
jective enjoys algorithms and analyses for both 
the non-stochastic and stochastic settings while 
to the best of our knowledge, the best-arm iden¬ 
tification framework has only been considered 
in the stochastic setting. We introduce the non¬ 
stochastic setting under this framework, identify 
a known algorithm that is well-suited for this set¬ 
ting, and analyze its behavior. Next, by lever¬ 
aging the iterative nature of standard machine 
learning algorithms, we cast hyperparameter op¬ 
timization as an instance of non-stochastic best- 
arm identification, and empirically evaluate our 
proposed algorithm on this task. Our empirical 
results show that, by allocating more resources to 
promising hyperparameter settings, we typically 
achieve comparable test accuracies an order of 
magnitude faster than baseline methods. 


learning models, resulting in a sequence of losses that even¬ 
tually converges to the final loss value at convergence. For 
example. Figure [T| shows the sequence of validation losses 
for various hyperparameter settings for kernel S VM models 
trained via stochastic gradient descent. The figure shows 
high variability in model quality across hyperparameter set¬ 
tings. It thus seems natural to ask the question: Can we 
terminate these poor-performing hyperparameter settings 
early in a principled online fashion to speed up hyperpa¬ 
rameter optimization? 



1. Introduction 


Figure 1. Validation error for different hyperparameter choices for 
a classification task trained using stochastic gradient descent. 


As supervised learning methods are becoming more widely 
adopted, hyperparameter optimization has become increas¬ 
ingly important to simplify and speed up the development 
of data processing pipelines while simultaneously yield¬ 
ing more accurate models. In hyperparameter optimiza¬ 
tion for supervised learning, we are given labeled training 
data, a set of hyperparameters associated with our super¬ 
vised learning methods of interest, and a search space over 
these hyperparameters. We aim to find a particular config¬ 
uration of hyperparameters that optimizes some evaluation 
criterion, e.g., loss on a validation dataset. 

Since many machine learning algorithms are iterative in 
nature, particularly when working at scale, we can evalu¬ 
ate the quality of intermediate results, i.e., partially trained 


Although several hyperparameter optimization methods 


have been proposed recently, e.g., Snoek et al. (2012 


2014 

!; Hutter et al. (2011); Bergstra et al. ( 

2011 

!; Bergstra 

& Bengio 

(2012), the vast majority of them consider the 


training of machine learning models to be black-box proce¬ 
dures, and only evaluate models after they are fully trained 
to convergence. A few recent works have made attempts to 
exploit intermediate results. However, these works either 
require explicit forms for the convergence rate behavior of 
the iterates which is difficult to accurately characterize for 
all but the simplest cases (Agarwal et al. 2012| [Swersky 


et al.| 2014[>, or focus on heuristics lacking theoretical un¬ 


derpinnings ( Sparks et al.| 2015| l. We build upon these pre¬ 
vious works, and in particular study the multi-armed bandit 
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formulation proposed in Agarwal et al. (|2012|l and Sparks 


et al. ( 2015[>, where each arm corresponds to a fixed hy¬ 


perparameter setting, pulling an arm corresponds to a fixed 
number of training iterations, and the loss corresponds to 
an intermediate loss on some hold-out set. 


We aim to provide a robust, general-purpose, and widely 
applicable bandit-based solution to hyperparameter opti¬ 
mization. Remarkably, however, the existing multi-armed 
bandits literature fails to address this natural problem set¬ 
ting: a non-stochastic best-arm identification problem. 
While multi-armed bandits is a thriving area of research, 
we believe that the existing work fails to adequately ad¬ 
dress the two main challenges in this setting: 


1. We know each arm’s sequence of losses eventually 
converges, but we have no information about the 
rate of convergence, and the sequence of losses, like 
those in Figure |T] may exhibit a high degree of non¬ 
monotonicity and non-smoothness. 


2. The cost of obtaining the loss of an arm can be dispro¬ 
portionately more costly than pulling it. For example, in 
the case of hyperparameter optimization, computing the 
validation loss is often drastically more expensive than 
performing a single training iteration. 


2. Non-stochastic best arm identification 


Objective functions for multi-armed bandits problems tend 
to take on one of two flavors: 1) best arm identification (or 
pure exploration) in which one is interested in identifying 
the arm with the highest average payoff, and 2) exploration- 
versus-exploitation in which we are trying to maximize 


the cumulative payoff over time (Bubeck & Cesa-Bianchi 


2012 1 . While the latter has been analyzed in both the 
stochastic and non-stochastic settings, we are unaware of 
any work that addresses the best arm objective in the non¬ 
stochastic setting, which is our setting of interest. More¬ 
over, while related, a strategy that is well-suited for maxi¬ 
mizing cumulative payoff is not necessarily well-suited for 
the best-arm identification task, even in the stochastic set¬ 
ting ( |Bubeck et al.| |2009] >. 


Best Arm Problem for Multi-armed Bandits 

input: n arms where Ii : k denotes the loss observed on the fcth 

pull of the ith arm 

initialize: Ti = 1 for all i £ [n] 

for t = 1, 2, 3,... 

Algorithm chooses an index It £ [n] 

Loss is revealed, Ti t = Ti t + 1 

Algorithm outputs a recommendation J t £ [n] 

Receive external stopping signal, otherwise continue 


We thus study this novel bandit setting, which encom¬ 
passes the hyperparameter optimization problem, and an¬ 
alyze an algorithm we identify as being particularly well- 
suited for this setting. Moreover, we confirm our theory 
with empirical studies that demonstrate an order of magni¬ 
tude speedups relative to standard baselines on a number of 
real-world supervised learning problems and datasets. 

We note that this bandit setting is quite generally applica¬ 
ble. While the problem of hyperparameter optimization in¬ 
spired this work, the setting itself encompasses the stochas¬ 
tic best-arm identification problem (Bubec k et al.| [2009), 
less-well-behaved stochastic sources like max-bandits 
|cirello &~S mith 2005), exhaustive subset selection for fea¬ 
ture extraction, and many optimization problems that “feel” 
like stochastic best-arm problems but lack the i.i.d. as¬ 
sumptions necessary in that setting. 

The remainder of the paper is organized as follows: In Sec¬ 
tion [2] we present the setting of interest, provide a survey 
of related work, and explain why most existing algorithms 
and analyses are not well-suited or applicable for our set¬ 
ting. We then study our proposed algorithm in Section [3] 
in our setting of interest, and analyze its performance rela¬ 
tive to a natural baseline. We then relate these results to the 
problem of hyperparameter optimization in Section [4] and 
present our experimental results in Section [5] 


Figure 2. A generalization of the best arm problem for multi¬ 
armed bandits l |Bubeck et al.[ |2009| that applies to both the 
stochastic and non-stochastic settings. 

The algorithm of Figure [2] presents a general form of the 
best arm problem for multi-armed bandits. Intuitively, at 
each time t the goal is to choose J t such that the arm as¬ 
sociated with J t has the lowest loss in some sense. Note 
that while the algorithm gets to observe the value for an 
arbitrary arm J t , the algorithm is only evaluated on its rec¬ 
ommendation J t , that it also chooses arbitrarily. This is in 
contrast to the exploration-versus-exploitation game where 
the arm that is played is also the arm that the algorithm is 
evaluated on, namely, I t . 

The best-arm identification problems defined below require 
that the losses be generated by an oblivious adversary, 
which essentially means that the loss sequences are inde¬ 
pendent of the algorithm’s actions. Contrast this with an 
adaptive adversary that can adapt future losses based on all 
the arms that the algorithm has played up to the current 
time. If the losses are chosen by an oblivious adversary 
then without loss of generality we may assume that all the 
losses were generated before the start of the game. See 
( |Bubeck & Cesa-Bianchi| |2012[ ) for more info. We now 
compare the stochastic and the proposed non-stochastic 
best-arm identification problems. 
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Stochastic : For all i £ [/t.], k > 1, let be an i.i.d. sam¬ 
ple from a probability distribution supported on [ 0 , 1 ]. 
For each i, E[f)^] exists and is equal to some constant 
Hi for all k > 1. The goal is to identify arg min,; /a 
while minimizing ^™ =1 

Non-stochastic (proposed in this work) : For all i £ 

[n], k > 1 , let ii^ £ K be generated by an oblivious 
adversary and assume Vi = lim li T exists. The goal 

T—>00 ’ 

is to identify arg min^ Vi while minimizing ^" =1 


These two settings are related in that we can always turn 
the stochastic setting into the non-stochastic setting by 
defining ii.T., = Ylk =l t where 7' T are the losses 
from the stochastic problem; by the law of large numbers 
limT^oo £i tT = E[f' -J. In fact, we could do something 
similar with other less-well-behaved statistics like the min¬ 
imum (or maximum) of the stochastic returns of an arm. 
As described in |Cicirello & Smith| ( |2005| >, we can define 
0-i,Ti = min{7' l5 £' 2 ,..., £[ T }, which has a limit since 
£i t t is a bounded, monotonically decreasing sequence. 


However, the generality of the non-stochastic setting intro¬ 
duces novel challenges. In the stochastic setting, if we set 

V-i,Ti = T~J2kLi^i,k then - m\ < y / log( 2 4 "' i ' ] 

for all i £ [n] and 7’, > 0 by applying Hoeffding’s 
inequality and a union bound. In contrast, the non¬ 
stochastic setting’s assumption that lim^oo i j iT exists im¬ 
plies that there exists a non-increasing function 7 ,; such that 
| £i,t - lim T _ HXS 4 )T | < 7 i(t) and that lim^c*, 7 * (f) = 0 . 
However, the existence of this limit tells us nothing about 
how quickly 7 *(t) approaches 0. The lack of an explicit 
convergence rate as a function of t presents a problem as 
even the tightest 73 (f) could decay arbitrarily slowly and 
we would never know it. 


This observation has two consequences. First, we can never 
reject the possibility that an arm is the “best” arm. Second, 
we can never verify that an arm is the “best” arm or even 
attain a value within e of the best arm. Despite these chal¬ 
lenges, in Section[3]we identify an effective algorithm un¬ 
der natural measures of performance, using ideas inspired 
by the fixed budget setting of the stochastic best arm prob¬ 


lem ( Karnin et al.||2013| |Audibert & Bubeck| |2010| |Gabil- 

|lon et al. 2012 [ >. 


2.1. Related work 

Despite dating to back to the late 1950’s, the best-arm 
identification problem for the stochastic setting has expe¬ 
rienced a surge of activity in the last decade. The work 
has two major branches: the fixed budget setting and the 
fixed confidence setting. In the fixed budget setting, the 
algorithm is given a set of arms and a budget B and is 
tasked with maximizing the probability of identifying the 


Exploration algorithm 

# observed losses 

Uniform (baseline) (B) 

n 

Successive Halving* (B) 

2n -f -1 

Successive Rejects (B) 

(n + l)n /2 

Successive Elimination (C) 

nlog 2 ( 2 H) 

LUCB (C), lil’UCB (C), EXP3 (R) 

B 


Table 1. The number of times an algorithm observes a loss in 
terms of budget B and number of arms n, where B is known to 
the algorithm. (B), (C), or (R) indicate whether the algorithm is 
of the fixed budget, fixed confidence, or cumulative regret variety, 
respectfully. (*) indicates the algorithm we propose for use in the 
non-stochastic best arm setting. 


best arm by pulling arms without exceeding the total bud¬ 
get. While these algorithms were developed for and ana¬ 
lyzed in the stochastic setting, they exhibit attributes that 
are very amenable to the non-stochastic setting. In fact, the 
algorithm we propose to use in this paper is exactly the Suc¬ 
cessive Halving algorithm of Karnin et al. ( 2013| >, though 
the non-stochastic setting requires its own novel analysis 
that we present in Section[3] Successive Rejects (Audibert 
|& Bub eck 2010) is another fixed budget algorithm that we 
compare to in our experiments. 

The best-arm identification problem in the fixed confidence 
setting takes an input S £ (0,1) and guarantees to out¬ 
put the best arm with probability at least 1 — <5 while at¬ 
tempting to minimize the number of total arm pulls. These 
algorithms rely on probability theory to determine how 
many times each arm must be pulled in order to decide 
if the arm is suboptimal and should no longer be pulled, 
either by explicitly discarding it, e.g.. Successive Elimina¬ 
tion (Even-Dar et al. 2006jl and Exponential Gap Elimina¬ 
tion ( |Kamin e t al.||2013|), or implicitly by other methods, 
e.g., LUCB ( Kalyanakrishnan et al.| 2012[ > and Lil’UCB 
(Jamieson et al. 20141. Algorithms from the fixed con¬ 


fidence setting are ill-suited for the non-stochastic best- 
arm identification problem because they rely on statisti¬ 
cal bounds that are generally not applicable in the non¬ 
stochastic case. These algorithms also exhibit some un¬ 
desirable behavior with respect to how many losses they 
observe, which we explore next. 


In addition to just the total number of arm pulls, this work 
also considers the required number of observed losses. This 
is a natural cost to consider when £, 7 ^ for any i is the re¬ 
sult of doing some computation like evaluating a partially 
trained classifier on a hold-out validation set or releasing a 
product to the market to probe for demand. In some cases 
the cost, be it time, effort, or dollars, of an evaluation of 
the loss of an arm after some number of pulls can dwarf the 
cost of pulling the arm. Assuming a known time horizon 
(or budget), Table[l]describes the total number of times var¬ 
ious algorithms observe a loss as a function of the budget B 
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and the number of arms n. We include in our comparison 
the EXP3 algorithm (Auer et al. j 20021, a popular approach 
for minimizing cumulative regret in the non-stochastic set¬ 
ting. In practice B n, and thus Successive Halving is 
a particular attractive option, as along with the baseline, it 
is the only algorithm that observes losses proportional to 
the number of arms and independent of the budget. As we 
will see in Section [5] the performance of these algorithms 
is quite dependent on the number of observed losses. 


3. Proposed algorithm and analysis 

The proposed Successive Halving algorithm of Figure [3] 
was originally proposed for the stochastic best arm identifi¬ 


cation problem in the fixed budget setting by (Karnin et al. 


20131. However, our novel analysis in this work shows that 
it is also effective in the non-stochastic setting. The idea 
behind the algorithm is simple: given an input budget, uni¬ 
formly allocate the budget to a set of arms for a predefined 
amount of iterations, evaluate their performance, throw out 
the worst half, and repeat until just one arm remains. 


Successive Halving Algorithm 

input: Budget B, n arms where denotes the fcth loss from 
the ith arm 
Initialize: So = [n]. 

Fork = 0,1,..., riog 2 (n)l - 1 

Pull each arm in Sk for r k = I „ , n B , ,, I additional 

L i'Sfeiriog2( n )i J 

times and set Rk = X^"=o r i ■ 

Let ak be a bijection on Sk such that f CTfc (i ),R k < 
4 fc (2),H fc < ••• < t'r k {\S k \),R k 

Sk+i = {te s k - 4 fc (<),fl fc < ^M\s k \/ 2 ]),R k }- 

output : Singleton element of Sp og ( n )i 


Figure 3. Successive Halving was originally proposed for the 
stochastic best arm identification problem in |Kamin et al.|(2013) 
but is also applicable to the non-stochastic setting. 


The budget as an input is easily removed by the “doubling 
trick” that attempts B -t— n, then B -t— 2 B, and so on. This 
method can reuse existing progress from iteration to iter¬ 
ation and effectively makes the algorithm parameter free. 
But its most notable quality is that if a budget of B’ is nec¬ 
essary to succeed in finding the best arm, by performing 
the doubling trick one will have only had to use a budget 
of 2 B' in the worst case without ever having to know B' in 
the first place. Thus, for the remainder of this section we 
consider a fixed budget. 


3.1. Analysis of Successive Halving 

We first show that the algorithm never takes a total number 
of samples that exceeds the budget B: 


riog 2 (n)l-l 

E \ s <*\ 

k=0 


B 

|Sfc|riog(n)l 


riog 2 (n)l-t 

J - E ri°g(n)l - B ■ 

k=0 


Next we consider how the algorithm performs in terms of 
identifying the best arm. First, for i = 1,..., n define z/, = 
limT-^oo £i >T which exists by assumption. Without loss of 
generality, assume that 


v\ < < ■ ■ ■ < v n . 

We next introduce functions that bound the approximation 
error of t with respect to i\ as a function of t. For each 
i = 1 , 2 ,..., n let 7i(f) be the point-wise smallest, non¬ 
increasing function of t such that 

\£i,t ~ Vi\ < li{t) Vt. 

In addition, define "if 1 {a) = minjf £ N : ji(t) < a} for 
all i £ [n]. With this definition, if U > 'y~ 1 { — 3 2 1 ' 1 ) and 
it > 7 i 1 ( ! ^r L ) then 

£i,ti — ^l,ti = {£i,ti ~ Vi) + ( V 1 ~ + 2 ( 2 Vl ) 

> -7ife)-7t(it)+2(tV L ) >°- 

Indeed, ifmin{fj,fi} > max{7,L 1 ( 1>i ~ Vl ), 7~ x ( Vi ~ Vl )} 

then we are guaranteed to have that £i >t . > £i t t x - That is, 
comparing the intermediate values at t x and l \ suffices to 
determine the ordering of the final values v t and u \. In¬ 
tuitively, this condition holds because the envelopes at the 
given times, namely 7j(tj) and 71 (fi), are small relative to 
the gap between vt and u \. This line of reasoning is at the 
heart of the proof of our main result, and the theorem is 
stated in terms of these quantities. All proofs can be found 
in the appendix. 

Theorem 1 Let Vi = lim r , 7 (t) = max 7 j(f) and 

t— >00 ’ i=l,...,n 

z = 2[log 2 (n)j max i (1 4- 7 -1 ) 

i=2,...,n 

<2riog 2 (n)l(n+ E T 1 ^))- 

i=2,...,n 

If the budget B > z then the best arm is returned from the 
algorithm. 

The representation of z on the right-hand-side of the in¬ 
equality is very intuitive: if 7(f) = 7j(f) Vi and an ora¬ 
cle gave us an explicit form for 7(t), then to merely verify 
that the ith arm’s final value is higher than the best arm’s, 
one must pull each of the two arms at least a number of 
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times equal to the vth term in the sum (this becomes clear 
by inspecting the proof of Theorem [3}. Repeating this ar¬ 
gument for all * = 2 ,..., n explains the sum over all n — 1 
arms. While clearly not a proof, this argument along with 
known lower bounds for the stochastic setting ( jAudibertj 
|& Bubeckj |2010[ |Kaufmann et al.| |2014[ >, a subset of the 
non-stochastic setting, suggest that the above result may be 
nearly tight in a minimax sense up to log factors. 


Example 1 Consider a feature-selection problem where 
you are given a dataset {( Xi , t/i )}" =1 where each € M. D 
and you are tasked with identifying the best subset of fea¬ 
tures of size d that linearly predicts yi in terms of the 
least-squares metric. In our framework, each d-subset is 
an arm and there are n = (^) arms. Least squares is 
a convex quadratic optimization problem that can be ef¬ 
ficiently solved with stochastic gradient descent. Using 
known boun ds for the rates of convergence ( Nemirovskij 


et al. 


2009 one can show that 7 a (t) < ZsJLsffUjl f or 
all a = T 7 - • ■; tl arms and all t > 1 with probability 
at least 1 — 5 where cr a is a constant that depends on 
the condition number of tlie^ quadratic defined by the d- 
subset. Then in Theorem 
^max = Uiax a _-| 


0 70 ) 


. log (nt/S) 


with 

cr a so after inverting 7 we find that 

log( 2n<T ““ N 


4ct„ 


2 = 2[log 2 (n)] max a= 2,...,n a- ~ is a 

sufficient budget to identify the best arm. Later we put this 
result in context by comparing to a baseline strategy. 


In the above example we computed upper bounds on the 
7 i functions in terms of problem dependent parameters to 
provide us with a sample complexity by plugging these val¬ 
ues into our theorem. However, we stress that constructing 
tight bounds for the 7, functions is very difficult outside 
of very simple problems like the one described above, and 
even then we have unspecified constants. Fortunately, be¬ 
cause our algorithm is agnostic to these 7, functions, it is 
also in some sense adaptive to them: the faster the arms’ 
losses converge, the faster the best arm is discovered, with¬ 
out ever changing the algorithm. This behavior is in stark 
contrast to the hyperparameter tuning work of Swersky 


et al. (20141 and Agarwal et al. ( 2012|>, in which the algo¬ 


rithms explicitly take upper bounds on these 7., functions as 
input, meaning the performance of the algorithm is only as 
good as the tightness of these difficult to calculate bounds. 


3.2. Comparison to a uniform allocation strategy 

We can also derive a result for the naive uniform budget 
allocation strategy. For simplicity, let B be a multiple of n 
so that at the end of the budget we have 7’, = B/n for all 
i £ [n] and the output arm is equal to i — arg min,; s/n- 


Theorem 2 (Uniform strategy - sufficiency) Let v, = 


lim l iT , 7(f) = max i=1 nliifi) and 

r—>00 ’ 

2 = ,.T 7 B ’" 1(a¥1) - 

If B > z then the uniform strategy returns the best arm. 

Theorem [2] is just a sufficiency statement so it is unclear 
how the performance of the method actually compares to 
the Successive Halving result of Theorem[I] The next theo¬ 
rem says that the above result is tight in a worst-case sense, 
exposing the real gap between the algorithm of Figure [3] 
and the naive uniform allocation strategy. 

Theorem 3 (Uniform strategy - necessity) For any given 
budget B and final values Vi < v 2 < ■ ■ ■ < v n there exists 
a sequence of losses {£i,t}tZv i = 1,2,... ,n such that if 

B< max n7 -1 (^A^) 

then the uniform budget allocation strategy will not return 
the best arm. 

If we consider the second, looser representation of z on 
the right-hand-side of the inequality in Theorem [T] and 
multiply this quantity by we see that the sufficient 
number of pulls for the Successive Halving algorithm es¬ 
sentially behaves like (n — l)log 2 (?r) times the average 
Si=2 n 7 _1 ( Vi If ' 1 ) whereas the necessary result 
of the uniform allocation strategy of Theorem [3] behaves 
like n times the maximum max i=2j ... i n7~ 1 ~" 1 ). The 

next example shows that the difference between this aver¬ 
age and max can be very significant. 

Example 2 Recall Example 1 and now assume that a a = 
cr max for all a = 1 ,...,n. Then Theorem [3] says 
that the uniform allocation budget must be at least 

4oWxlog( 

n - 1 ^ 2 -m. 2 — to id ent ify { i le b est arm. To see 

how this result compares with that of Successive Halv¬ 
ing, let us parameterize the v a limiting values such that 
v a = a/nfor a = 1 ,...,n. Then a sufficient budget 
for the Successive Halving algorithm to identify the best 

arm is just 8?t|"log 2 (n)]cr m ax log while the uni¬ 

form allocation strategy would require a budget of at least 
2 n 2 cr max log ^ ra ppi s i s a difference of essentially 

An log 2 (n) versus n 2 . 

3.3. A pretty good arm 

Up to this point we have been concerned with identify¬ 
ing the best arm: 17 = arg mini *7 where we recall that 
Vi = lim ii T . But in practice one may be satisfied with 

T—f OO ’ 

merely an e-good arm i e in the sense that v lt v\ < e. 
However, with our minimal assumptions, such a statement 
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is impossible to make since we have no knowledge of the 7$ 
functions to determine that an arm’s final value is within e 
of any value, much less the unknown final converged value 
of the best arm. However, as we show in Theorem [4] the 
Successive Halving algorithm cannot do much worse than 
the uniform allocation strategy. 


3. Choose the hyperparameters that minimize the em¬ 
pirical loss on the examples in VAL: 9 = 

argmineee pb| E ieVAL l° ss 

4. Report the empirical loss of 9 on the test error: 

|TEST| EiGTEST l- oss (fg( x i)iyi)- 


Theorem 4 For a budget B and set ofn arms, define isn 
as the output of the Successive Halving algorithm. Then 

%SH - ^ ^ riog 2 («)l27 ( Lpoglwi j) ■ 

Moreover, ijj, the output of the uniform strategy, satisfies 

»i v -vi< k,B/n - l 2 TB/n + 2fi(B/n) < 2j(B/n). 

Example 3 Recall Example 1. Both the Successive Halv¬ 
ing algorithm and the uniform allocation strategy satisfy 
Vi — v 1 < O (n/B) where i is the output of either algo¬ 
rithm and O suppresses poly log factors. 

We stress that this result is merely a fall-back guarantee, 
ensuring that we can never do much worse than uniform. 
However, it does not rule out the possibility of the Suc¬ 
cessive Halving algorithm far outperforming the uniform 
allocation strategy in practice. Indeed, we observe order of 
magnitude speed ups in our experimental results. 


4. Hyperparameter optimization for 
supervised learning 

In supervised learning we are given a dataset that is com¬ 
posed of pairs (a:',;, y,) £ X x y for i = 1 ,,n sampled 
i.i.d. from some unknown joint distribution Px,y, and we 
are tasked with finding a map (or model) f : X y 
that minimizes E (x,y)~p xy [l°ss(/(X), Y)] for some 
known loss function loss : y x y —>■ R. Since Px.y is un¬ 
known, we cannot compute E(x,y)~ PxY [loss(f(X),Y)\ 
directly, but given m additional samples drawn i.i.d. from 
Px.y we can approximate it with an empirical estimate, 
that is, — loss(f(xi),yi). We do not consider ar¬ 

bitrary mappings X —> y but only those that are the out¬ 
put of running a fixed, possibly randomized, algorithm A 
that takes a dataset {(i',;, y,;)}™ =1 and algorithm-specific 
parameters 9 £ 0 as input so that for any 9 we have 
fe = A ({(: Vi, yi)}i =1 ,9) where fg-.X-r y. For a fixed 
dataset {(xi, t/j)}" =1 the parameters 9 £ 0 index the dif¬ 
ferent functions fg, and will henceforth be referred to as 
hyperparameters. We adopt the train-validate-test frame¬ 
work for choosing hyperparameters (Hastie et al. 2005|>: 


1. Partition the total dataset into train, val , and TEST 
sets with TRAIN U VAL U TEST = {(xi, Ui)}^. 

2. Use TRAIN to train a model fg = 

for each 9 £ 0, 


Example 4 Consider a linear classification exam¬ 
ple where X x y = M. d x {—1,1}, 0 C R+, 
fg = A({(xi,yi)} i( z TRAIN ,9) where fg(x) = ( wg,x) 
with wg = argmin^p^i E; e rMi W max (°> 1 - 
yi(w,Xi)) + 0||iu|||, and finally 9 = 

arg mingge EievAL Hv M x ) < °1- 


In the simple above example involving a single hyperpa¬ 
rameter, we emphasize that for each 9 we have that fg 
can be efficiently computed using an iterative algorithm 
(Shalev-Shwartz et al. 2011), however, the selection of / is 
the minimization of a function that is not necessarily even 
continuous, much less convex. This pattern is more often 
the rule than the exception. We next attempt to generalize 
and exploit this observation. 


4.1. Posing as a best arm non-stochastic bandits 
problem 

Let us assume that the algorithm A is iterative so that for a 
given {(xi, train and 9, the algorithm outputs a func¬ 
tion fg f every iteration t > 1 and we may compute 

^8,t = |val| 'y > i- oss {fe,t{xi),yi). 

ieVAL 

We assume that the limit lim t _>.(*, £g_ t exist^jand is equal 

| VAL I EieVAL l° ss (/e( a -i)i Vi)' 

With this transformation we are in the position to put the 
hyperparameter optimization problem into the framework 
of Figure [2] and, namely, the non-stochastic best-arm iden¬ 
tification formulation developed in the above sections. We 
generate the arms (different hyperparameter settings) uni¬ 
formly at random (possibly on a log scale) from within 
the region of valid hyperparameters (i.e. all hyperparam¬ 
eters within some minimum and maximum ranges) and 
sample enough arms to ensure a sufficient cover of the 
space ( jBergstra & Bengio] |2012| ). Alternatively, one could 
input a uniform grid over the parameters of interest. We 
note that random search and grid search remain the default 
choices for many open source machine learning packages 

*We note that fg = fg :t is not enough to conclude 

that limt-Kx, exists (for instance, for classification with 0/1 
loss this is not necessarily true) but these technical issues can usu¬ 
ally be usurped for real datasets and losses (for instance, by re¬ 
placing 1 {z < 0} with a very steep sigmoid). We ignore this 
technicality in our experiments. 
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such as LibSVM (|Chang & Lin |2011 1 , scikit-leam (Pe 


dregosa et al. 20111) a nd MLlib (Kraska et al. 2013). As 

described in Figure 13] the bandit algorithm will choose It, 
and we will use the convention that J t = argming Ie,T„- 
The arm selected by J t will be evaluated on the test set 
following the work-flow introduced above. 


4.2. Related work 


plemented in Python and run in parallel using the multi¬ 
processing library on an Amazon EC2 c3.8xlarge instance 
with 32 cores and 60 GB of memory. In all cases, full 
datasets were partitioned into a training-base dataset and 
a test (test) dataset with a 90/10 split. The training-base 
dataset was then partitioned into a training (train) and 
validation (val) datasets with an 80/20 split. All plots re¬ 
port loss on the test error. 


We aim to leverage the iterative nature of standard machine 
learning algorithms to speed up hyperparameter optimiza¬ 
tion in a robust and principled fashion. We now review 
related work in the context of our results. In Section [3731 
we show that no algorithm can provably identify a hyperpa¬ 
rameter with a value within e of the optimal without known, 
explicit functions 7 ,;, which means no algorithm can reject 
a hyperparameter setting with absolute confidence with¬ 
out making potentially unrealistic assumptions. |Swersky 
et al. (2014|) explicitly defines the 7 , functions in an ad-hoc. 


algorithm-specific, and data-specific fashion which leads 
to strong e-good claims. A related line of work explicitly 
defines 7 ,-like functions for optimizing the computational 
efficiency of structural risk minimization, yielding bounds 
(Agarwal et al. 2012[ ). We stress that these results are only 
as good as the tightness and correctness of the 7 , bounds, 
and we view our work as an empirical, data-driven driven 
approach to the pursuits of |Agarwal et al.| ( |2012| ). Also, 
Spar ks et al.| ( |2015| ) empirically studies an early stopping 
heuristic for hyperparameter optimization similar in spirit 
to the Successive Halving algorithm. 

We further note that we fix the hyperparameter settings (or 
arms) under consideration and adaptively allocate our bud¬ 
get to each arm. In contrast, Bayesian optimization advo¬ 
cates choosing hyperparameter settings adaptively, but with 
the exception of Swersky et al. ( |2014| ), allocates a fixed 
budget to each selected hyperparameter setting (Snoek 


etal. 2012 2014 Hutteretal. 201 1| Bergstra et al. 2011 


Bergstra & Bengio 201 2\ . These Bayesian optimization 

methods, though heuristic in nature as they attempt to si¬ 
multaneously fit and optimize a non-convex and potentially 
high-dimensional function, yield promising empirical re¬ 
sults. We view our approach as complementary and or¬ 
thogonal to the method used for choosing hyperparameter 
settings, and extending our approach in a principled fash¬ 
ion to adaptively choose arms, e.g., in a mini-batch setting, 
is an interesting avenue for future work. 


To evaluate the different search algorithms’ performance, 
we fix a total budget of iterations and allow the search al¬ 
gorithms to decide how to divide it up amongst the differ¬ 
ent arms. The curves are produced by implementing the 
doubling trick by simply doubling the measurement budget 
each time. For the purpose of interpretability, we reset all 
iteration counters to 0 at each doubling of the budget, i.e., 
we do not warm start upon doubling. All datasets, aside 
from the collaborative filtering experiments, are normal¬ 
ized so that each dimension has mean 0 and variance 1 . 


5.1. Ridge regression 


We first consider a ridge regression problem trained with 
stochastic gradient descent on this objective function with 
step size .01 /\/2~+T\. The (2 penalty hyperparameter 
A £ [10~ 6 ,10°] was chosen uniformly at random on a 
log scale per trial, wth 10 values (i.e., arms) selected per 
trial. We use the Million Song Dataset year prediction task 
(Lichman 2013| > where we have down sampled the dataset 
by a factor of 10 and normalized the years such that they 
are mean zero and variance 1 with respect to the training 
set. The experiment was repeated for 32 trials. Error on the 
val and test was calculated using mean-squared-error. In 
the left panel of Figure|4]we note that LUCB, lil’UCB per¬ 
form the best in the sense that they achieve a small test er¬ 
ror two to four times faster, in terms of iterations, than most 
other methods. However, in the right panel the same data 
is plotted but with respect to wall-clock time rather than it¬ 
erations and we now observe that Successive Halving and 
Successive Rejects are the top performers. This is explain¬ 
able by Table [I] EXP3, lil’UCB, and LUCB must evaluate 
the validation loss on every iteration requiring much greater 
compute time. This pattern is observed in all experiments 
so in the sequel we only consider the uniform allocation. 
Successive Halving, and Successive Rejects algorithm. 


5.2. Kernel SVM 


5. Experiment results 

In this section we compare the proposed algorithm to a 
number of other algorithms, including the baseline uniform 
allocation strategy, on a number of supervised learning hy¬ 
perparameter optimization problems using the experimen¬ 
tal setup outlined in Section|4~T| Each experiment was im¬ 


We now consider learning a kernel SVM using the RBF 
kernel k 7 (x,z) = e _7 H x_iS ll 2 . The SVM is trained us¬ 
ing Pegasos (Shalev-Shwartz et al. 2011) with (2 penalty 
hyperparameter A £ [ 10 ' (i 7"l0 n | and kernel width 7 £ 
[ 10 °, 10 3 ] both chosen uniformly at random on a log scale 
per trial. Each hyperparameter was allocated 10 samples 
resulting in 10 2 = 100 total arms. The experiment was 
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Figure 4. Ridge Regression. Test error with respect to both the number of iterations (left) and wall-clock time (right). Note that in the 
left plot, uniform, EXP3, and Successive Elimination are plotted on top of each other. 


repeated for 64 trials. Error on the val and test was cal¬ 
culated using 0/1 loss. Kernel evaluations were computed 
online (i.e. not precomputed and stored). We observe in 
Figure[5]that Successive Halving obtains the same low error 
more than an order of magnitude faster than both uniform 
and Successive Rejects with respect to wall-clock time, de¬ 
spite Successive Halving and Success Rejects performing 
comparably in terms of iterations (not plotted). 



Figure 5. Kernel SVM. Successive Halving and Successive Re¬ 
jects are separated by an order of magnitude in wall-clock time. 

5.3. Collaborative filtering 

We next consider a matrix completion problem using the 
Movielens 100k dataset trained using stochastic gradient 



Figure 6. Matrix Completion (bi-convex formulation). 

descent on the bi-convex objective with step sizes as de¬ 
scribed in [Recht & Re| ( |2013j >. To account for the non- 
convex objective, we initialize the user and item variables 
with entries drawn from a normal distribution with vari¬ 
ance <r 2 /d, hence each arm has hyperparameters d (rank), 
A (Frobenium norm regularization), and a (initial condi¬ 
tions). d £ [2, 50] and a £ [.01,3] were chosen uniformly 
at random from a linear scale, and A £ [10~ 6 ,10°] was 
chosen uniformly at random on a log scale. Each hyperpa¬ 
rameter is given 4 samples resulting in 4 3 = 64 total arms. 
The experiment was repeated for 32 trials. Error on the val 
and test was calculated using mean-squared-error. One 
observes in Figure [ 6 ] that the uniform allocation takes two 
to eight times longer to achieve a particular error rate than 
Successive Halving or Successive Rejects. 

6. Future directions 

Our theoretical results are presented in terms of max, 7 , (/,). 
An interesting future direction is to consider algorithms and 
analyses that take into account the specific convergence 
rates 7 ,(f) of each arm, analogous to considering arms 
with different variances in the stochastic case ( jKaufmann] 
et al.] [2014| . Incorporating pairwise switching costs into 
the framework could model the time of moving very large 
intermediate models in and out of memory to perform iter¬ 
ations, along with the degree to which resources are shared 
across various models (resulting in lower switching costs). 
Finally, balancing solution quality and time by adaptively 
sampling hyperparameters as is done in Bayesian methods 
is of considerable practical interest. 
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A. Proof of Theorem [I] 

Proof For notational ease, define [•] = {{-}t = i }” =1 so that = {ki,t}t^=i}i=i- Without loss of generality, we may 
assume that the n infinitely long loss sequences with limits {vi}™ = i were fixed prior to the start of the game so that 
the 7 i(t) envelopes are also defined for all time and are fixed. Let O be the set that contains all possible sets of n infinitely 
long sequences of real numbers with limits {vi}™ =1 and envelopes [ 7 (f)], that is, 

^ = {Vy : t K,t - y < 7(0 ] a = n v *} 

where we recall that A is read as “and” and V is read as “or.” Clearly, [£j t ] is a single element of O. 

We present a proof by contradiction. We begin by considering the singleton set containing [£j. t ] under the assumption 
that the Successive Halving algorithm fails to identify the best arm, i.e., ,S'pio g2 (r t )l 7 ^ 1- We then consider a sequence of 
subsets of fi, with each one contained in the next. The proof is completed by showing that the final subset in our sequence 
(and thus our original singleton set of interest) is empty when B > z, which contradicts our assumption and proves the 
statement of our theorem. 

To reduce clutter in the following arguments, it is understood that S' k for all k in the following sets is a function of [£' t ] in 
the sense that it is the state of S k in the algorithm when it is run with losses [(' t ]. We now present our argument in detail, 
starting with the singleton set of interest, and using the definition of Sk in Figure [3] 


py e n : Kt = kt] A Sfiog^)! ^ 1 

flog 2 (n)l 


[ge« : [y=y A v iesu 

k=1 ' 

S riogaWl-i s xv 

= {[kt] e n : Kt = kt] A V { E < KrJ > U^|/2J || 

k—0 

. riog 2 (n)]-l , N >. 

= Kt] € n : Kt = kt] A V k Rk -Vi- k Rk + Vi} > LI^|/ 2 J 

^ k =0 ^ i£S' k ' ' 

f riog 2 (n)l-i , . . 

C Kt] e n: V E kvi -V!< K Rk - y + Kr> yI > LI^I/ 2 J 

1 k= 0 L ies' k J J 

. riog 2 (n)l- 1 , v 

C Kt] e n : V E !{27 (Rk) >Vi- > [\S' k \/2\ , 

^ fc =0 L ieS'. J J 


(1) 


where the last set relaxes the original equality condition to just considering the maximum envelope 7 that is encoded in 
Q. The summation in Eq.[l]only involves the v,, and this summand is maximized if each S' k contains the first S' k arms. 
Hence we have. 


flog 2 ( n )l — 1 c l S fcl 


f I I r I fc I N N 

flc Kjed: v e>li^i/ 2 j 

k —0 »=1 '' 

r riog 2 (n)l-l N 

= \ Vi,i\ e ^ : V |27(i? fc ) > Vy\ s i k I/2J + 1 _ ^1} r 

k —0 ' 

[log 2 (n)l-l 

c{|f', ]e n: V («*<7 

^ b —n * 


( 2 ) 


where we use the definition of 7 1 in Eq.g Next, we recall that R k = E*L 0 L |g fc |rifg a (»)1 J > (L|s fc |/2j+E°g 2 (")l “ 1 
since \S k \ < 2([|5fc|/2j + 1). We note that we are underestimating by almost a factor of 2 to account for integer effects in 
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favor of a simpler form. By plugging in this value for Ilf.- and rearranging we have that 

flog 2 (n)l —1 


0 C \ [i'^\ GO: \/ { 

k =0 


B/2 


riog 2 (n)] 


< 


(LI5'|/2j + 1)(1 +7 " 1 ( ^^ J — ))} 


= {[<J «» = i 4 fer< »„ 5 .,h ii|s1i,!J + 1)(1 + ^‘ 

C {[^, t ] G 0 : B < 2[log 2 (n)] max i ( 7- 1 + 1) 

I 1=2 

where the last equality holds if B > z. 

The second, looser, but perhaps more interpretable form of z is thanks to ( Audibert & Bubeck||2010) l who showed that 

xjT 1 (^) < £ T 1 (t) ^ l0 S2(2n) .max^^ 1 (^) 

i=2,...,n 


where both inequalities are achievable with particular settings of the v, variables. 


B. Proof of Theorem |2] 

Proof Recall the notation from the proof of Theorem [T| and let 1 ([£' t ]) be the output of the uniform allocation strategy 
with input losses [£[ t \. 


K,t) G n : [g iit = ii,t] A i([? itt ]) = \ [<*] G n : [<* = h, t ] A ^ B/n > . mm 


C 1 [£' t ] G fi : 2 "/(B/n) > min z/j — 17 1 

I i=2,...,n J 

= jVz.t] e n '■ 2 *l( B / n ) > ^2 - *7 j 

c|[^]Gfi:S<n7- 1 (^)}=0 


where the last equality follows from the fact that B > z which implies i([fj it ]) = 1. 


C. Proof of Theorem |3] 

Proof Let /3(f) be an arbitrary, monotonically decreasing function of f with lim^oo /3(f) = 0. Define t\ t = v\ + /3(f) 
and = Vi — /3(f) for all i. Note that for all i, 7 i(f) = 7 (f) = /3(f) so that 


i = 1 4,B/n < . min k,B/n 

1=2 

+=> 1/1+7 (B/n)< min Vi~'y(B/n) 

i=2,...,n 

*7 + i(B/n) < v 2 — j(B/ri) 
v 2 - Vi 


7 (B/n) < 

B > nT 1 ■ 
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D. Proof of Theorem [H 

We can guarantee for the Successive Halving algorithm of Figure [3]that the output arm i satisfies 


i/r — v\ = mm Vi — Vi 
iG ' s 'riog 2 (^)i 


mm Vi — mm Vi 

i£Sk+i itSk 


riog 2 (n)l-l 

= E 

k =0 

ri°g 2 ( n )i _ 1 

< E min £ iMk - minfi^ + 2^(R k ) 

f 1'£Ok +1 

k —0 

riog 2 (n)l-l 

= E 27 (i? fc ) < riog 2 (n)l 27 (L ?iriog B 2 (n) 1 j) 

k—0 


simply by inspecting how the algorithm eliminates arms and plugging in a trivial lower bound for R k for all k in the last 
step. 




